Ising Models for Inferring Network Structure 

From Spike Data 



J. A. Hertz 

Niels Bohr Institute, Blegdamsvej 17, 2100 Copenhagen 0, Denmark 
NORDITA, Roslagstullsbacken 23, 10691 Stockholm, Sweden 

Yasser Roudi 

Kavli Institute for Systems Neuroscience, NTNU, Trondheim, Norway 
NORDITA, Roslagstullsbacken 23, 10691 Stockholm, Sweden 

Joanna Tyrcha 

Mathematical Statistics, Stockholm University, S106 91 Stockholm, Sweden 



Abstract 

Now that spike trains from many neurons can be recorded simultaneously, 
there is a need for methods to decode these data to learn about the networks 
that these neurons are part of. One approach to this problem is to adjust the 
parameters of a simple model network to make its spike trains resemble the 
data as much as possible. The connections in the model network can then 
give us an idea of how the real neurons that generated the data are connected 
and how they influence each other. In this chapter we describe how to do 
this for the simplest kind of model: an Ising network. We derive algorithms 
for finding the best model connection strengths for fitting a given data set, 
as well as faster approximate algorithms based on mean field theory. We 
test the performance of these algorithms on data from model networks and 
experiments. 
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1 Introduction 

Now that we can record the spike trains of large numbers of neurons 
simultaneously, we have a chance, for the first time in the history of neu- 
roscience, to start to understand how networks of neurons work. But how 
are we to proceed, once we have such data? In this chapter, we will review 
some ideas we have been developing. The reader will recognize that we are 
only describing the very first steps in a long journey. But we hope that they 
will help point the way toward real progress some time in the not-too-distant 
future. 

What is it we want to learn from multi-spiketrain data? Here we will 
focus on finding the connectivity in the network under study. We hope that 
ultimately this can help us understand the dynamics of the network and 
how it implements computations. For example, in the hippocampus and 
entorhinal cortex, many interesting kinds of neurons (place cells, grid cells, 
etc.) have been identified, but we do not know how they interact. If we could 
find out how they are connected, we might be able to get a much better idea 
of how they compute the animal's estimate of its location. 

Of course, we will not be able to find the full connectivity unless we can 
record from all neurons, which is impossible, at least with any current tech- 
niques. Several orders of magnitude separate the number of electrodes in 
a recording array and the number of neuron in the area it covers, so any 
apparent effect of one neuron on another might actually be mediated by un- 
observed neurons. Thus, we (and everyone else in the field) are forced into 
using terms like "effective connections" or "functional connectivity." In this 
chapter, for brevity, we will generally just write "connections" or "connec- 
tivity", but we will always mean that they are "effective" or "functional" 
connections. 

Given this fundamental limitation on our project, it is nevertheless worth 
remarking that while all connections are functional one, some kinds are more 
functional than others. Modeling approaches that are closer to biophysical 
reality stand a better chance of recovering something like the real synaptic 
connectivity than those which are farther away. The approach we describe 
here is at least a first step, albeit a modest one, in the right direction. 

One can take a rather different attitude, proceeding formally and making 
a purely statistical model: a mathematical object with some parameters that 
characterize the data. However, even if such a model fits the data well, one 
has in general no way to assign biological meaning to the parameters. This 
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is not to say that we don't care about fit quality - we do. It is only to say 
that fitting, for example, equal-time neuronal firing correlations in the data 
perfectly with some model is not very satisfying if the parameters of the 
model cannot be related fairly directly to the connectivity of the network. 

To present the models and formal approaches that we study, we have to 
start with our notation for the data. We work with time-binned spike trains 
under the assumption that firing rates are low enough that there is (almost) 
never more than one spike per bin. Time will be measured in units of the 
bin size. We denote the spike train of neuron i by <%(£), where Si(t) = +1 
if it fires in bin t and Si(t) = — 1 if it does not. (One can equivalently use 
Si = for no spike, but we won't do that here.) Thus we can visualize the 
data as a big array, the "spike matrix" (N x T if there are N neurons and T 
time bins), of +ls and —Is. 

This representation of the data lends itself to formulating the problem in 
terms of a very simple, perhaps the simplest possible model: a McCulloch- 
Pitts network (McCulloch and Pitts, 1943), or what in physics is called an 
Ising model. In this chapter we will use several different kinds of Ising models 
to treat the problem of finding the connections in a network from its spiking 
history. 



2 The Gibbs Equilibrium Model 

A possible formal approach to the problem is this (Schneidman et al, 
2006): One considers the data as the set of columns on the spike matrix, i.e., 
all the spike patterns observed, ignoring their temporal order, and asks what 
distribution they could have been sampled from. Since one is treating all the 
patterns as coming from the same distribution, one is implicitly assuming 
stationarity in the data. If one seeks the distribution with the largest entropy, 
consistent with the sample means rrii = (Si) and correlations ((Si — rrii)(Sj — 
m i))flone finds something with a familiar look (at least if one is a statistical 
physicist): 

P[S] = |exp \Y. J ^AS j + Y.h i S i (1) 

This is the Gibbs equilibrium distribution for the (pairwise) Ising model. 
Its parameters, Jij and hi, are Lagrange multipliers that are used in the 



1 Statisticians would call these covariances, but here we follow the convention in statis- 
tical physics and call them correlations. 
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constrained maximization. The parameter is the strength with which 
unit (neuron) j influences unit i, and the bias "field" hi controls how likely 
unit i is to be +1 ("fire") in the absence of the other units. The expectation 
values of the Si under the distribution (JTJ), denoted m, ("magnetizations" in 
the original context of the model), are related to the neuronal firing rates rj 
(in units of spikes/time bin) through rrti = 2r.; — 1. 

One might like to think of as something like a synaptic strength. 
However, the J matrix in (pQ) is necessarily symmetric, Jjj = Jji. This is true 
for the couplings in magnets, but it is generally not true for synapses, so we 
have to be cautious about what J„ means. 

A few remarks: In physics, the normalizing denominator Z in ([I]) is called 
the partition function, and its log is the negative of the free energy. We are 
everywhere setting the temperature equal to 1, since changing the tempera- 
ture just amounts to rescaling the J^s and his by a constant factor. 

Normally in statistical mechanics, one is given a model and its param- 
eters, and the task is to find the moments (Si), (SiSj), etc., which are the 
quantities that can be measured in experiments. But here we have an inverse 
problem: we are given the measured correlation functions and want to find 
the parameters of the model. 

It has been known for some time how to solve this problem, at least in 
principle. One should adjust the h{ and to maximize the probability ([!]), 
evaluated on the data. Doing this by gradient ascent gives learning rules 



The averages in the second terms are under the model with the current 
values of the J^- and hi, and rj is a learning rate, to be chosen small enough 
to get smooth convergence. To evaluate them exactly, we have to know the 
probabilities (jTJ), which requires evaluating the partition function Z. To do 
this exactly one has to sum 2 N terms, which is feasible for systems up to 
iV m 20. For larger N one has to estimate the averages by Monte Carlo 
simulations of the model. This fact makes the learning slow, especially when 
working with data from long recordings. One wants the estimates of the 
model averages to be as good as the direct data averages in the first terms, 
so the length of each Monte Carlo run has to be about equal to the number 
of time bins in the data set. It may take many iterations to reach stationary 
values of the and hi, so this can be a lot of computing. Just what limit this 




(2) 
(3) 
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places on the size of systems that can usefully be studied this way depends 
on patience and the computing resources available, but in our work we found 
it impractical to try to work with N > 100. 

These learning rules are a special case of what is called Boltzmann learn- 
ing (Ackley et al, 1985), which also covers systems with "hidden", i.e., un- 
recorded units. In this case, the first terms on the right-hand sides of §2§ 
and ([3]), when evaluated for hidden units, are averages over the model with 
the visible units clamped to their measured values. This requires even more 
Monte Carlo runs. 



2.1 Mean- field methods 

It is possible to avoid long Monte Carlo runs using mean field methods. 
Two such methods have been proposed. One is based on heuristic mean-field 
equations first applied to magnetic systems by Weiss more than a century ago. 
The other is based on improved equations first proposed by Onsager. They 
were applied to spin glasses by Thouless, Anderson and Palmer (Thouless et 
al, 1977), so nowadays in statistical mechanics they are called TAP equations. 

Both the original mean-field and TAP equations are approximate partial 
solutions to the forward problem, i.e., calculating the magnetizations mj. 
From them, we will derive corresponding solutions to the inverse problem. 
These solutions are approximate, but become very good in the limits of weak 
coupling or, for dense connections, large N. 

The idea of mean field theory is very simple. The exact rrii, conditional 
on a set of Sj connected to i by the coupling matrix J^-, is the difference of 
Boltzmann probabilities 



p(S i = l\{S j })-p{S i = -l\{S j }) 



■? g^i+X/j JijSj _|_ g hi y\ JijSj 

= tanh (hi + ^ JySA ■ (4) 

The approximation consists of replacing the Sj inside the tanh by their av- 
erages rrii. 

Tdi = tanh ^hi + ^ Jij m j^j • (5) 

This is apparently a good approximation when the sum on j has many terms 
(a loose kind of central-limit argument). This is called the mean field limit. 
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Onsager argued that the contribution to the neighbor magnetizations rrij 
from the central unit Si itself should not be counted in estimating the field 
acting on Si. This leads to corrected mean- field equations, 



m,4 = tanh 



(6) 



TAP showed that these equations, rather than (jSJ), should be used in spin 
glasses, where the Jjj are random, with a zero or very small mean, because 
then the Onsager correction term is of the same order as the naive mean 
field. In this sense, these equations are the correct mean field equations for 
spin glasses. It has become conventional to call them "TAP equations" . We 
will use the abbreviation "nMF" for the naive mean field equations (JSJ). 

Plefka showed that §5$) and are the first two in a sequence of better 
and better approximations that can be derived systematically (Plefka, 1982), 
but we will only consider these first two here. 

It is convenient to write the TAP equations in the form 

hi = tanh -1 m; - Jijirij + J^(l - m]) (7) 

j j 

The matrix 

= = ~ * ~ Ui^i (8) 

is the inverse susceptibility matrix. In equilibrium statistical mechanics there 
is a theorem that the susceptibility matrix is (up to a factor of the temper- 
ature, which we are setting equal to 1) equal to the correlation matrix 

C i:j = ((Si- mi){Sj -rrij)) (9) 

Thus, if we know the correlation matrix, we simply need to invert it and 
solve (for i ^ j) 

(C-%- = -J tj - 2Jf j m i m J (10) 

for Jij (Kappen and Rodriguez, 1998; Tanaka, 1998; Roudi et al, 2009). This 
is the TAP algorithm. There are n(n — 1)/2 quadratic equations to be solved, 
but they are decoupled. For the original, "naive" mean field approximation, 
it is even simpler: The last term on the right hand side of f llOp is absent, and 
we just have = —(C^ 1 )^. 
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3 Kinetic Ising Models 

The Ising model as described so far has no dynamics. It is defined solely 
by the Gibbs equilibrium distribution (JTJ). Since the system we are trying 
to understand is a noisy dynamical one, we would like to fit its data - the 
spike trains - by a stochastic dynamical model. And the Ising model can 
be given a dynamics in a natural way, as follows (Glauber, 1963). At each 
time, every neuron has a chance oc dt of being updated in the infinitesimal 
interval [t, t + dt) (i.e., neuron updates are independent Poisson processes). 
For a neuron that is updated, we compute the total "field" 

H i {t) = h i {t)+Y, J a s A t )- (11) 

3 

(We can allow the external field hi to depend on time.) We then let S% take 
its next value Si(t + At) according to the probability 

(12) 

It is then possible to show that if hi is independent of t and the matrix J is 
symmetric, this model has a stationary distribution given by (00). 

Since we would like to identify the with synaptic strengths, which are 
not symmetric, we relax the symmetry condition. Furthermore, here we will 
update all the neurons simultaneously instead of randomly asynchronous, 
because it makes the model easier to apply to time-binned data, such as we 
are assuming we have here. (We set At = 1 from now on.) With either of 
these changes, the Gibbs equilibrium distribution ([T]) is no longer a stationary 
solution of the dynamics, even if the hi are time-independent. In the case 
that they are and J is symmetric, there is a stationary distribution, 

pr gl = I1J2 cosh(Ej JijSj + hj)} 

(Peretto, 1984). When the hi are time-independent and J is not symmet- 
ric, a stationary distribution may still exist, but it cannot be written down 
in closed from. Since such a state is not described by the Gibbs equilib- 
rium distribution (TI]), we call it a non-equilibrium state, even though it is 
stationary. 
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3.1 Stationary case: exact algorithm 

For simplicity, we consider first the case where the hi are time- independent 
and we assume a stationary distribution of the Si. If we are given a set of 
spike trains in the form S = {Si(t)}, 1 < % < N, we can derive an algorithm 
for finding the model parameters Jij and hi by performing gradient ascent 
on the log-likelihood of the data under the model, which is 

L[S,J,h] = J2[Si(t+ 1)^(1) -\og2 cosh Hi(t)}, (14) 

it 

with Hi(t) given by (TTTT) . This gives learning rules 

Shi = r? + l)) t -(tanhF i (0) t ] (15) 
SJij = 7/[<S , i(t+l)S , i (t)) t -(tanhiZ i (t)S , i (t))J (16) 

(Roudi and Hertz, 2011a). Equations (|T5]) and ([To]) have a form analogous 
to ([2]) and ([3]) for the equilibrium case: the right hand sides are differences 
between averages over the data and averages under the current model. How- 
ever, here the latter can be evaluated directly and quickly from the data, so 
this algorithm is generally much faster than the equilibrium one. 

It is practical in implementing this algorithm to express the neuronal 
firing variables in terms of the differences 5Si(t) = Si(t) — rrii, with = 
(Si(t)) t and to write Hi(t) in the form 

Hi(t) = bi + J2 J ^ S j(t) ( 17 ) 



with 



Then we have 



J2 J a m j- ( 18 ) 



Sk = T] [rrii - (tanhi/j(£)) t ] (19) 
SJ^ = ri {D^ - {[tank H^t)- mi) SSjit))^, (20) 



with 



Dij = (SSi(t+ l)5Sj(t)), (21) 

the one-step-delayed correlation matrix. The factor in brackets in the time 
average in (|2T)|) can be recognized as the expectation value of SSi (t + 1), 
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given the measured Sj(t), so the second term is the one-time-step delayed 
correlation matrix for the model, conditional on the data at t. 

This algorithm is exact, in the sense that for data generated by a kinetic 
Ising model of this form, it will recover the and hi exactly, after infinite 
iterations, for infinite data. (By "infinite data" we mean an infinite number 
of time steps.) 

3.2 Stationary case: mean-field methods 

Like the equilibrium model, this model allows systematic and rapid mean- 
field and TAP approximations (Roudi and Hertz, 2011a). To derive the 
mean-field algorithm, we start by assuming that the rrii (of the data) satisfy 
the mean-field equations (JSJ). Then, on the right-hand side of f fT6|) . we write 
Si(t) = rrii + SSi(t) and expand the tanh to first order in the J^. First-order 
terms in 5S cancel, and we are left with 

8Jij = (6Si(t + l)5S,(t)) - (1 - ml) £ JtoiSSkitySjit)). (22) 

k 

If the Jik are the true ones, then we should have 5Jij = 0. Thus we obtain a 
simple matrix equation 

D = AJC (23) 

where 

C i3 = (5S t (t)5S,(t)) (24) 
Aij = (l-m?)<f - (25) 

The J matrix can be found simply by matrix inversion and multiplication in 
terms of first- and second-order statistics of the data: 

J = A -1 DC _1 . (26) 

Knowing the Jy, the inversion task is then completed by solving the MF 
equations (j5J) for hf. 

hi = tanh -1 m, — V] J^rrij. (27) 

3 

Likewise, we can get a TAP approximation by assuming that the rrii of the 
data satisfy the TAP equations (jB]) and expanding the tanh to third order. 
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After a little algebra, and ignoring some terms which are small for networks 
with weak, dense connections, we again find the formula (I2"3"j) , but where now 

A ij = (l-m?)(l-Fd6 ij , (28) 

with 

F r = (l-m^J2JUl-m 2 k ). (29) 

k 

We cannot solve for the directly in this case, since now Ay depends on 
them through the Fi. One way to do it is by iteration, using, say, the mean- 
field result ( |26|) in (|29|) to get an initial estimate of Fi, using this in ([28~]) to 
get better Jys, and so on. However in the stationary case this iteration can 
be circumvented by a trick: Since we can write as J^ MF /(1 — Fi), we can 
use f l2"9~j) to obtain 

„ (1 - m?) E fc (^ MF ) 2 (i-^ (3Q) 



(l - i^.) 2 

which is a cubic equation that we can solve for Fi, given the mean-field J. 
The relevant root is the one that goes to zero as — > 0. This root cannot 
exceed 1/3, which restricts this method to fairly weak couplings. Still, it is 
an improvement on the mean-field approximation. 

In the same way as for the MF case, once the Jy- are found, we can go 
back to the TAP equations in the form (J7]) to obtain the hi. 

3.3 Nonstationary case: exact algorithm 

Most of the above carries over to the nonstationary case, where the hi 
depend on t. However, now different time bins are not statistically equivalent, 
so we require data from (many) runs of the system. Thus, now we will write 
the spike trains as Si(t, r), where r labels the different runs and t is the time 
index within a run. The "magnetizations" are now t-dependent averages 
over runs, 

rm(t) = (Si{t,r)) r , (31) 

and we define SSi(t, r) = Si(t, r) — rrii(t). 

In (j!4p Si(t + 1) and Hi(t) acquire a run argument r. The learning rules 
analogous to f[T9l and (TSUI) are again straightforward to derive by differenti- 
ating with respect to and J if 

6bi(t) = 7 1 [m i (t + l)- (tanh Hi(t,r)) r ] (32) 

bJ i3 = r)T [A; - ([tanh^(t, r) - rm(t + l)]6S j (t,r)) r J , (33) 
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where T is the run length and 

Dij = + l,r)£S i (t,r)) t , t . (34) 

Note that it is not the same as the (|2"T1) because now the SSi are defined 
relative to the time- dependent run averages rrii(t), not the time-independent 

When we use the word "exact" to describe this algorithm, we mean it 
in the same sense as for the stationary algorithm, but "infinite data" now 
means an infinite number of runs to average over in computing the statistics. 



3.4 Nonstationary case: mean-field methods 

First one needs time-dependent mean-field and TAP equations. These 
have been derived recently (Roudi and Hertz, 2011b), using a systematic 
scheme analogous to that employed by Plefka for the equilibrium case. The 
dynamic TAP equations for the time-dependent magnetizations take the form 



rrii(t + 1) = tanh 



K{t) + E J a m At) - mi(t + 1) £ 4(1 - m)(t)) 



(35) 



For (naive) mean field, the equations are the same except that the last term 
inside the brackets is absent. 

Now the same procedure that led to ( 12"31 gives (Roudi and Hertz, 2011a) 

= £ M(l - mlit + l))C kj {t)) u (36) 

k 

where 

C kj {t) = (6S k {t,r)6Sj{t,r)) r . (37) 
If we define a set of matrices by 

sg = <(l-f^(t + l))C7 w (t)> tl (38) 

we can again solve for J by simple matrix algebra: 

J ij = E^[( Bli) )1fcr (39) 

k 

The calculation is a little more time-consuming than the stationary one, 
because now we have to invert a different matrix for each i. Nevertheless, 
it is still much faster than using the exact algorithm. 



11 



For TAP, in the same way as in the stationary calculation, a correction 
factor 1 — Fi appears. Now Fi is time-dependent, 

F(t) = (1 - m\(t + 1)) £ 4(1 - m 2 fe (t)), (40) 

and the (1 — Fj)-factor enters the time average that defines the matrices B^: 

4? = ((1 - + 1))(1 - m))C kj (t)) t , (41) 

Now there is no simple equation one can solve for Fi(t), as there was in the 
stationary case, so solving for the J-matrix has to be done iteratively. 

However, we have found that qualitatively good reconstruction can be 
done if we make the approximation of factoring the average in (|41[) : 

((1 - m»(t + 1))(1 - F l {t))C kj {t)) t « (1 - Fi)((l - m*(t + l))C kj (t)) t , (42) 

where Fi = (Fi(t)) t . Now Fi solves a cubic equation, as in the stationary 
case: 

m - F?) = E(^ fc MF ) 2 ((l - m*(t + 1))(1 - ml(t))) t , (43) 

k 

and, given Fi, one has Jjj = J™ Mi? /(l — Fj), again as in the stationary case. 
4 Errors 

In a Bayesian way, we can interpret the likelihood expL(J,b) as a prob- 
ability density for the Js and 6s with a flat prior. This density will peak 
around the parameters found by the algorithm, and if the data are generated 
by the kinetic Ising model itself, the peak will lie at their true values in the 
limit of infinite data. The spread of the probability density around the peak 
gives a measure of how the parameters obtained from a single data set will 
vary from one data set to another. Because the log likelihood is proportional 
to the data set size, it is sufficient, for large T (stationary case) or many 
runs (nonstationary case) to expand it to second order in the deviations SJij 
from the values Jo, bo at the maximum, so the resulting density is a Gaus- 
sian. Suppressing, for simplicity, the dependence on the 6s and the overall 
normalization, we find, for the nonstationary case, 

P(J) oc JJexp [L(Jo,b ) - ^RTj^BpJijSJiA , (44) 

i \ jk J 
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where R is the number of runs and the B^ are the matrix elements that 

occurred above (141 p in the mean-field algorithm. Here we can see that KTB$ 
are the elements of the Fisher information matrix of the distribution of Js. 
The factor of RT means that the fluctuations of the Js around their central 
values go to zero like 1 / V RT for large data sets. This makes more precise 
our statement that the exact algorithms described above recover the correct 
model parameters correctly in the limit of infinite data, if those data are 
generated by our kinetic Ising model. 

To get a little more insight, consider the stationary case, where B^l = 
T(l—m1)Cjk- The one can see (1) that J^s with different first (postsynaptic) 
indices are uncorrelated, and (2) J^s with different second (presynaptic) 
indices will have strongly correlated errors if they have large projections 
along eigenvectors of C that have small eigenvalues. Thus, elements of J 
identified as large by the mean field approximation ( 126]) are exactly the ones 
for which the estimate is most uncertain. Finally, if correlations in the data 
are very weak, C is almost diagonal, and 

< ,j w = r( i-4Ta -■*?) ■ (45) 

These considerations show that reconstruction is harder when the data are 
strongly correlated and when rates are low (m; near -1). 

For the mean-field or TAP algorithms, the distribution of Js will have a 
form like ( l4"4"j) . but it will not be centered at the true maximum. The mean 
square errors will be 

(SJiMk) = + (4 - J D(4 - •#")■ (46) 

In cases (such as weak coupling) where the approximations are good, the 
mean square errors will show the 1/RT- dependence for small data sets, but 
for large RT they will level off at a residual value given by the second term 
in PB]). 

If the data come from a different model than our kinetic Ising one, we can 
never obtain its true parameters, even if we use the "exact" algorithms. In 
this case we say that the model that generated the data was "unrealizable" . 
The only "realizable" cases are ones where the model used to make the fit is 
the same as the one used to make the data. When the two models are in some 
sense close to each other and the Js have a meaning in the true model (for 
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example, if we added a term with weak delayed interactions J2j Kij$j{t — 1) 
to (I lip ), we can have a situation like that for the approximate algorithms, 
with errors that first decrease with data set size and then level off at residual 
values. But in general all we can say is that we have found the best kinetic 
Ising model, i.e., the kinetic Ising model most likely to have generated the 
data. 

The real "true" model, i.e., nature itself, is far from our kinetic Ising 
model. Neurons have much more complex dynamics than our model endows 
them with, and synapses have dynamics of their own that we have ignored 
entirely in the present formulation. Thus, in applying our methods to neu- 
ral data, we have to be rather cautious about interpreting the connection 
strengths J^- that we obtain. 

Furthermore, even if we could replace our simple binary units by models 
with realistic dynamics and include suitable models for the synapses, we could 
not expect to recover their parameters correctly from an inference procedure 
like this, because most of the neurons in the true network are not recorded. A 
typical experiment records spikes of about 100 neurons in a region of cortex of 
area about 10 mm 2 , which contains of the order of a million neurons. Hence, 
even ignoring the fact that neurons in this region are coupled to many outside 
it, we have data from only about 0.01% of the local network. It is clearly 
an audacious project to hope to learn about the network connections and 
dynamics from such limited data (hence the resort to terms like "functional 
connectivity" to describe the results). 

On the other hand, we have to start somewhere, and the simple descrip- 
tion of a neuron as a stochastic element that fires at a rate that is a sigmoidal 
function of its net input is at least qualitatively consistent with neurophys- 
iology. While we cannot expect to associate all the JijS we find with direct 
synaptic connections, we would be surprised if the strongest JjjS we found 
were not indicative of synapses between recorded neurons (assuming we had 
chosen the time bin size for our data sensibly, i.e., around the sum of typical 
synaptic and membrane time constants). Our findings (Roudi and Hertz, 
2011a; Sect 10.6.2 below), as well as results obtained applying this approach 
to genetic regulatory network networks (Lezon et al, 2006), support this 
optimism. 

5 Regularization 

The algorithms above can be given a simple little extra twist that allows 
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us to eliminate the smallest (presumably, least significant) connections in the 
network in a controlled way. The idea (Tibshirani, 1996; Ravikumar et al, 
2010) is to try to minimize a cost function 

E = -L[S,lh] + \J2\Ji 1 \- (47) 

In a Bayesian perspective, the addition of this term is the imposition of a 
prior exp(— XJ2 \ Jij\) 011 the Jij. It leads to a new term in the update rules 
([IB"]) . (|20|) and (1331) for proportional to — AsgnJjj. To see its effect, think of 
the space of Js, with an axis for each pair ij. Under learning, the extra term 
pushes the J vector toward the edges of a hyper-octahedron where many of 
its components vanish. The regularization parameter A is chosen small, so 
this is a weak effect for large J^. But it is a big effect for ~ A or smaller, 
and it tends to drive these Js to zero. Thus this modification of the learning 
performs a kind of automatic pruning of the network, the degree of which we 
can control by adjusting A. 

A possibility for another kind of regularization arises in using the non- 
stationary algorithms described above. The source of the time dependence 
of the driving fields hi(t) or fej(t) is the time dependence of the measured 
rrii(t), i.e., of the firing rates. For realistic amounts of data (perhaps 100 
repetitions of the stimulus in an experiment), one cannot expect to estimate 
a rate within a time bin to better than 10% accuracy. The apparent rates 
will fluctuate this much from one time bin to the next, and the inferred hi(t) 
will inherit these fluctuations. If we believe that the true rates vary more 
smoothly than this, we can add a further term to the cost function, 

Esmooth = §« - M* + !)] 2 - (48) 

it 

The resulting extra term in 5hi(t) is proportional to n[hi(t — 1) — 2hi(t) + 
h(t + 1)]. This smoothes hi by averaging hi(t) a little with hi(t ± 1). 

6 Testing the Models 

We have tested these models on three kinds of data: 

(1) Data generated by the models themselves, where we can measure directly 
how well they recover the (known) network parameters. 

(2) Data generated by a semi-realistic computational model of a cortical 
network. This model is much more complex than the kinetic Ising model we 
try to fit its data with, but we still know the true connectivity of the network. 
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(3) Data recorded from a real biological network, in this case of salamander 
retinal ganglion cells. 

6.1 Testing the models on themselves 

Testing the algorithms described above on how well they reconstruct the 
couplings of a kinetic Ising model has two purposes: (1) For the exact algo- 
rithms, to verify that they work according to the theoretical picture set out 
above, in particular, checking that the mean square errors fall off like 1/T 
(stationary case) or (RT)^ 1 (nonstationary case). (2) For the approximate 
algorithms, evaluating the average residual error in (|46| and studying how it 
depends on the strength of the couplings in the model. We summarize the 
findings (Roudi and Hertz, 2011a) here. 




(a) (b) 



Figure 1: Mean square errors for nMF (a) and TAP (b) approximations, 
stationary kinetic Ising model, as a function of the number of time steps T 
in the training data for noise levels g = 0.1 (*), 0.12 (+), 0.14 (circles), and 
0.16 (x). The solid lines are the theoretical predictions from equations (52) 
(nMF) and (53) with finite-size corrections (TAP), and the symbols are the 
errors measured for the Js obtained from the algorithms. The dashed line is 
the rms error 1/T for the exact algorithm, (adapted from Roudi and Hertz, 
2011) 

Fig. 1 shows mean square errors in the Js for a model in which the 
Jij are chosen independently from a Gaussian distribution of mean zero and 
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standard deviation g/yN, where N is the number of neurons in the network. 
Here we have taken N = 20 and all hi = 0. The nMF and TAP errors follow 
the exact-algorithm result 1/T for small T and then flatten out at minimum 
values (higher for nMF that TAP). 

The asymptotic nMF and TAP errors are systematic: For infinite data, 
nMF systematically underestimates the magnitude of the and TAP over- 
estimates them (though less so). In fact, the factor 1 — F$ relating the TAP 
and nMF Jys is the lowest-order correction of the nMF underestimation. We 
can see this simply, for zero field, by expanding the tanh in (fl6|) or (|20|) at 
8 Jij = to third order in the Js (instead of just to linear order as we did in 
022])). We get 

Dij = JikCkj — | 51 JikJilJim(SkSlS m Sj) + • • • . (49) 
k klm 

(All correlations here are equal-time.) In the sum over k, I and m, terms 
with k = I, I = m and m = k dominate. Multiplying on the right by the 
inverse of C and using the nMF result (1261) . we find 

JT F = Ja - (E 4) Ja (50) 

(plus corrections of relative order 1/N). Now we can use the fact that for 
our SK-like model, the quantity J2k Jfk = 9' % again ignoring corrections of 
higher order in 1/N, so we obtain 



TnMF 

J« = j^. (51) 



The denominator can be recognized as the TAP correction factor 1 — Fi when 
the hi = 0. This implies an asymptotic mean square error 

e„ MF = ((J« - Jf F ) 2 ) = £ (52) 

The solid curves in Fig la are 1/T + e n MF, evidently a quite good fit for the 
values of g from 0.1 to 0.16 for which they are plotted. The black dots in Fig 
2 show the nMF Js plotted against the true ones. When the size of the data 
set is increased from 10 4 to 10 6 , the scatter is reduced and the systematic 
underestimation becomes clear. In the limit of infinite data, we expect the 
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Figure 2: J's obtained by nMF (circles) and TAP (crosses) algorithms, plot- 
ted against the true Js for a network of 20 neurons with g = 0.35. (a) 
T = 10 4 ; (b) T = 10 6 . 

scatter to disappear, leaving a clean linear behavior near = with a slope 



The TAP errors shown in Fig lb are much smaller that the nMF ones, 
because they include the lowest-order correction (l5Tj) . We can find the leading 
error in the TAP approximation by expanding the tanh in ([16]) to fifth order. 
A little algebra like that leading to ( 1521) leads to an asymptotic error estimate 



Furthermore, one can show that TAP systematically overestimates the mag- 
nitudes of the Js, in the same way that nMF underestimated them. For a 
large network, we expect this to describe the asymptotic TAP error. However, 
there are finite-size corrections, of order g 6 /N 3 . These would be negligible 
for N ^> 1/g 2 , but they are not for the small network (N = 20) at the values 
of g studied here. However, taking both ( 1531) and the finite-size corrections 
into account, we are able to account for the measured errors (Fig. lb). 

In Fig 2 one can see that the TAP Js (square markers) lie much closer to 
the true ones than the nMF Js (*) do, but a careful look at Fig 2b reveals 
the systematic overestimation. 
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tTAP — 



N 



(53) 
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Finally, we give an example of nonstationary inference. We generated R = 
100 runs, each of length T = 10 5 steps for a model with hi = 0.5 cos(27rt/100). 
The couplings are small enough (g = 0.05) that nMF is quite accurate, as 
Fig 3a shows. Furthermore, the fields are also quite well recovered (dots, Fig 
3c) by solving fl35l) (without the TAP term) for hi(t). 




100 150 200 250 300 



(c) 

Figure 3: Nonstationary inference: 20 neurons, g = 0.05, driven by a sinu- 
soidal field with period 100 steps, (a) The couplings inferred using the non- 
stationary nMF algorithm are almost equal to the true ones, (b) Couplings 
inferred using the stationary algorithm are systematically overestimated, (c) 
The nonstationary algorithm recovers the driving field (solid curve) correctly 
(dots), while the amplitude of the field predicted from the mean field equa- 
tions using the couplings found by the stationary algorithm (crosses) is too 
small, (adapted from Roudi and Hertz, 2011) 

The Js obtained if one does the calculation assuming stationarity are sys- 
tematically too large (Fig 3b). This is because of the apparent correlations 
induced in the data by the common external field. In calculating correlations 
we should actually compute (SSi(t)SSj(t')) r using 5Si(t) = Si(t) — m;(t), 
with time-dependent m>i(t). However, if we assume stationarity and use 
5Si(t) = Si(t) — (rrii(t))t, instead, we will overestimate the correlations. (This 
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is sometimes called "stimulus-induced correlations".) These spurious correla- 
tions (all positive, in this case) then lead to a spurious increase in the inferred 
JijS. One can use these spurious Js in fl35l) (again, here with the TAP term 
absent) to infer hi(t), but because the Js are too big, the resulting hi(t) do 
not have a modulation amplitude as large as they should (Fig 3c, crosses). 

6.2 Application To Data From Model Cortical Network 

For a first nontrivial test of the approach, we try it on data generated by a 
realistic model cortical network (Hertz, 2010). This is a network of 1000 spik- 
ing neurons, 80% excitatory and 20% inhibitory, with Ho dgkin- Huxley- like 
intrinsic dynamics. They are driven by an additional external population of 
800 Poisson neurons firing tonically at 10 Hz. All neurons, both internal and 
external, are connected by conductance-based synapses in a random fashion, 
with a connection probability of 10% (except that the external neurons are 
not connected to each other). The strengths of these synaptic conductances 
were chosen so that the network was in a high- conductance balanced state 
(Amit and Brunei, 1997; van Vreeswijk and Sompolinsky, 1998) with average 
conductances in the range measured experimentally (Destexhe et al, 2003). 

The magnitudes of the synaptic conductances did not vary within a class 
(excitatory-to-excitatory, excitatory-to-inhibitory, etc.), except for the fact 
that 90% of them were zero because of the random dilution, though they did 
vary somewhat in their temporal characteristics. For such a system, a natural 
and important question to ask is how well we can identify the connections 
that are actually present. 

We choose for analysis the spike trains of the inhibitory neurons with 
rates over 10 Hz. There were 95 of them. Their mean rate was 32 Hz, 
with a maximum of 83 Hz. Higher firing rates give better statistics, and the 
inhibitory synapses are the strongest ones in this model, so this choice makes 
the task of fitting the model and identifying the connections present easier 
than it might otherwise be. However, it is useful to test the method on an 
easier problem before trying to solve a more difficult one. The data were put 
into the form to be used by the algorithms by binning the spikes with 10-ms 
bins. 

To study how much data is necessary for the fit, we did the analysis using 
the exact stationary algorithm for data sets of size T from 1000 up to 64000 
time bins. For each such T, we calculated the log likelihood of the data. 
For comparison, we also calculated these numbers for independent-neuron 
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models based on the same partial data sets. The results were adjusted with 
Akaike corrections equal to the number of parameters in the models (Akaike, 
1974). 

Fig 4 shows the log likelihoods (per time step, per neuron), with and 
without Akaike corrections. They appear to converge to a value of — 0.306 ± 
0.001, and evidently there is little point in using more than around 50000 
time steps of data. Also shown are the log likelihoods for an independent- 
neuron model (all = 0). For these, the Akaike correction is so small that 
the corrected and uncorrected curves would fall almost on top of each other, 
so only the adjusted values are actually plotted. The curve is nearly flat at 
at value of about -0.53. It is evident that the model with Js is much better 
than the one without them. 



-0.24 




T 



Figure 4: Cortical model data (tonic firing): Log likelihoods as a function 
of data set size T for fits with kinetic Ising and independent-neuron models. 
Dashed line: log likelihood. Solid line: Akaike-adjusted log likelihood. Dot- 
ted line: independent-neuron model (Jy = 0). Despite its greater number of 
parameters, the model with couplings is clearly better, and the difference is 
evident for all the data set lengths in the range shown. 

The Akaike-adjusted log likelihood is a suitable statistic for comparing 
model quality, but one does not have an clear idea of what a value of -0.306 
means for the quality of the network reconstruction, in particular for the 
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problem of identifying the connections present in this diluted network. To 
do this, we use another statistic, defined as follows. 
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Figure 5: Distributions of connection strengths found for cortical model net- 
work data for inhibitory connections present in the network (solid line) and 
connections not present in the network (dashed line). Averages of results of 
performing nMF inference on 30 sets of 50 neurons chosen randomly from 
the 95 inhibitory neurons with rates > 10 Hz. 

Consider the values of J^- that the algorithm finds for the connections that 
are actually present in the network. These estimates will have some spread 
around their mean value, because of the limited data and the mismatch 
between the kinetic Ising model and the real network. For the same reason, 
the values assigned to the JijS for which the connections {ij} are not present, 
which should be zero, will also be spread around their (small) mean value 
(Fig 5). If the spreads of these two distribution are small compared to the 
difference between their means, we can easily identify the true connections, 
even if we do not know which ones they are a priori. A measure of the 
difficulty of the task which we can compute (knowing the true connectivity) 
is the noise/signal ratio 

, 0"(^truc 7^ 0) + 0-(J t ruc = 0) 

d = ~uT\ 7T\ T ' ( 54 ) 
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where the standard deviations and means are over the Js found by the al- 
gorithm. Fig 6 shows d as a function of data set size T, calculated for the 
Js obtained by both the exact algorithm and the nMF and TAP approxi- 
mate formulae. The TAP and nMF results are too close to each other to 
distinguish, so only the latter is plotted. The plot in Fig 5 was made using 
TAP results for T = 200000 and shows the kind of errors one makes for 
d ~ 0.47: The false positive rate is 5.6% and the false negative rate is 7.2%. 
For d ~ 0.3, which one achieves with the exact algorithm, there are almost 
never any errors. 




T 



Figure 6: Fitting cortical network data: Noise/signal ratio d, as function of 
data set length T for nMF (dashed) and exact (solid) learning algorithms. 

Note, furthermore, that this reconstruction was achieved for a strongly 
undersampled network - There were 1000 neurons in the network but the 
reconstructions were done using only 50 of them at a time. This lends some 
support to our hope that reconstruction, at least for strong synapses, might 
be possible even though we cannot record from all the neurons in the network. 

In this example, we used the stationary versions of the algorithms be- 
cause the rate of the external driving population was constant. We have also 
studied where the external rate varies in time, performing the corre- 

sponding nonstationary analysis, which gives similar results. Results will be 
published elsewhere. 
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6.3 Application To Data Prom Salamander Retina 

We have also analyzed a data set provided by Michael Berry of recordings 
from 40 ganglion cells in a salamander retina. The retina was stimulated 120 
times by a 26.5-s movie clip (a total recording time of 3180 s). A nonsta- 
tionary treatment is natural, in view of the time- dependent stimulus. We 
expect a significant contribution to the connections identified in a stationary 
analysis from stimulus-induced correlations, as in Fig 3b. 

Such data have been analyzed in the past using the Gibbs equilibrium 
analysis described in Section 2 (Schneidman et al 2006), which assumes sta- 
tionarity. It is interesting to see whether the J^s obtained that way are 
related to those obtained using a stationary kinetic Ising model. They are 
not expected to be the exactly the same, since the Gibbs Js are determined 
solely by the equal-time correlations and the kinetic Ising ones depend on 
the one-time-step delayed correlations. Nevertheless, we can hope that if we 
have chosen the time bin size sensibly, the connections identified by the two 
methods will be similar. 

To find out, we fit both the Gibbs model and the stationary version of 
our kinetic Ising model to these data. Fig. 7a shows a scatter plot of the 
resulting two sets of J^s. While they are not so similar as to be concentrated 
along a straight line in the plot, they do tend to have the same sign (most of 
the points are in the first or third quadrants). 

The kinetic Ising model allows for an asymmetric J matrix, while sym- 
metry is enforced in the Gibbs model. To see how asymmetric the kinetic 
Ising J matrix is here, we made a scatter plot of the elements of J against 
those of its transpose (Fig 7b). Evidently, the magnitudes of J^- and Jji can 
be different, but they do not differ wildly, and they almost always agree in 
sign. This is consistent with the fact that they agree qualitatively with the 
fully symmetric Js found for the Gibbs model and with our expectation that 
they may be the result, to a strong degree, of stimulus-induced correlations, 
which are symmetric by definition. 

To find out to what degree, we then performed a nonstationary analysis, 
with time- dependent bi(t) in our kinetic Ising model. In addition, we made 
a fit for a model (also nonstationary) with independent neurons, in which all 
time dependence of firing rates is explained in terms of the time-dependent 
6s. 

We compared these three models on the basis of the log likelihoods of 
the data under them, with Akaike corrections for the different numbers of 
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(a) (b) 

Figure 7: Fig 7. Salamander retina data: (a) Couplings found from the 
stationary kinetic Ising model algorithm plotted against those obtained from 
the Gibbs equilibrium algorithm, (b) Testing for the degree of symmetry 
in couplings: elements of the coupling matrix J plotted against those of its 
transpose. 

model parameters. We found that the stationary model (Akaike-adjusted 
log likelihood per neuron per time bin —0.128 bits) was significantly worse 
than the nonstationary ones (—0.0927 bits for the independent model and 
—0.0906 bits for the full model). The difference between the fits given by the 
nonstationary models with and without J^s is quite small (0.0021 bits, only 
2.4% of the total log likelihood). A full description of this analysis will be 
published elsewhere. 

We conclude that essentially all the connections found for the stationary 
and Gibbs models come from stimulus-induced correlations. Once we include 
time-dependent 6s in the model, adding Js gives almost no improvement in 
the fit. Although it is disappointing not to be able to identify any important 
intrinsic connections in this network, the result illustrates how the nonsta- 
tionary analysis can uncover features of the system that stationary models 
(including the Gibbs model) can not. 

7 Further Developments 
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In all the above, we restricted the treatment to the simplest kind of 
kinetic Ising model, for pedagogical purposes. It is not hard to extend the 
model in various ways to bring it closer to neurobiological reality. The most 
obvious way is to modify the memory-less dynamics (11 1111 2p by letting the 
firing probability at t + 1 depend on spikes earlier than t: 

H .(t) = h .(t) + £ J^Sjit -8 + 1). (55) 

Thus, each connection in the model is characterized by a temporal kernel 
Jij(s). It describes synaptic and membrane potential dynamics. It is straight- 
forward to derive exact and mean- field learning algorithms for such a model. 

Models of this kind, called generalized linear models (GLMs), have been 
studied in recent years (Truccolo et al, 2004; Okatan et al, 2006). They were 
used (Pillow et al, 2008) in an extensive study of the signaling by a population 
of 27 monkey retinal ganglion cells. They have also been used (Rebesco et al, 
2010) to track changes in connection strengths in a network in which synaptic 
plasticity was induced by microstimulation. Thus, the feasibility and utility 
of such solutions of the inverse problem for real neural networks have been 
demonstrated. The simple cases described in this chapter can be thought 
of as "poor man's" GLMs. Since they have fewer parameters, they may be 
particularly useful models when data are limited. It may also be useful to 
extend the mean-field methods discussed here to GLMs. 

Another couple of directions in which further development of these models 
would be useful are the following: 

(1) As mentioned above, one never records from all the neurons in a 
network. A systematic approach to the inference problem in the presence 
of "hidden units" is needed. This problem could usefully be explored first 
for the simplest model (Illtil2p in a realizable case, to gain some knowledge 
about how to model the hidden part of the population. 

(2) Even the model with temporal synaptic kernels assumes "current- 
based" synapses, i.e., that a presynaptic spike causes a particular current 
to flow in or out of the postsynaptic neuron, independent of the state of 
that neuron. Actually, what the presynaptic spike causes is the stochastic 
opening and closing of a channel, selective for particular ions, in the post- 
synaptic membrane. The synaptic current (or, more precisely, its average 
over many presynaptic transmitter release events) is then the product of the 
conductance of the channel if it were open, the time-dependent probability 
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that it is open, conditional on the presynaptic spike, and the difference be- 
tween the instantaneous postsynaptic membrane potential and the reversal 
potential for that channel. The difference between such "conductance based" 
synapses and current-based ones may be quantitatively small for excitatory 
synapses, since their reversal potentials are far above the firing threshold, 
but they are not for inhibitory ones, whose reversal potentials are not much 
lower than typical subthreshold membrane potentials. It would be desirable 
and interesting to extend the kind of modeling here to include a little of 
this potentially relevant biophysics, starting perhaps with the limiting case 
of "shunting inhibition". 

(3) Finally, we mention a new mean-field inversion algorithm (Mezard 
and Sakallariou, 2011; Sakallariou et al, 2011) which reconstructs kinetic 
Ising models exactly when the J matrix is completely asymmetric. It would 
be interesting to see how well it works on data from more realistic models or 
experiments. 
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